These slides are intended to get you thinking about the data you are using, what it tells us, what it doesn’t etc., and to think carefully about what we’re trying to measure, and what we’re actually measuring.
# anscombes quartet longanscombe <- anscombe %>%pivot_longer(cols =everything(), #pivot all the columnscols_vary ="slowest", #keep the datasets together names_to =c(".value", "set"), #new var names; .value=stem of varsnames_pattern ="(.)(.)") #to extract var names kable(list(anscombe),caption="Anscombe's Quartet",booktabs =TRUE,valign ='t',row.names =FALSE,)
Anscombe’s Quartet
x1
x2
x3
x4
y1
y2
y3
y4
10
10
10
8
8.04
9.14
7.46
6.58
8
8
8
8
6.95
8.14
6.77
5.76
13
13
13
8
7.58
8.74
12.74
7.71
9
9
9
8
8.81
8.77
7.11
8.84
11
11
11
8
8.33
9.26
7.81
8.47
14
14
14
8
9.96
8.10
8.84
7.04
6
6
6
8
7.24
6.13
6.08
5.25
4
4
4
19
4.26
3.10
5.39
12.50
12
12
12
8
10.84
9.13
8.15
5.56
7
7
7
8
4.82
7.26
6.42
7.91
5
5
5
8
5.68
4.74
5.73
6.89
code
B <-data.frame(set=numeric(0), b0=numeric(0), b1=numeric(0))for (i in longanscombe$set) { m <-lm(y ~ x, data=longanscombe %>%filter(set==i)) B[i:i,] <-data.frame(i, coef(m)[1], coef(m)[2])}kable(list(B),caption="Anscombe's Quartet",booktabs =TRUE,valign ='t',row.names =FALSE,)
Anscombe’s Quartet
set
b0
b1
1
3.000091
0.5000909
2
3.000909
0.5000000
3
3.002454
0.4997273
4
3.001727
0.4999091
code
ggplot(longanscombe, aes(x = x, y = y)) +geom_point() +facet_wrap(~set) +geom_smooth(method ="lm", se =FALSE, color="red")
Data Generating Process
What produced the data we observe?
political process, actors, etc.
are those actors purposeful wrt the observed data? That is, do the actors have
data collector; choices, biases, mistakes.
Data Generating Process
Why do we observe the data we see and not the data we don’t?
existence is not randomly determined.
research questions are usually about things that happen, not things that do not or cannot.
reporting itself is a political process.
reporting is shaped by resources
The data generating process is the complete description of how the observed data arose and how other such data would arise. It includes variables, conditionals, functional forms, mappings from one unit to another, etc.
Thinking about the DGP
What are the units of observation? Who is taking action or having action done to them? Are the units heterogeneous, and if so, how?
What circumstances are the units in?
What are the units’ choice sets?
What relates the units circumstances to the outcomes?
How are the units related to each other?
What don’t we observe?
Can the DGP exist?
can the units experience the causal claim in question?
A Terrible Map
does the data represent the entire DGP or just part of it?
are we asking the right question? Are we making Type 3 errors - finding the right answer to the wrong question? Kennedy (2002)(p. 572) writes:
A type III error, introduced in Kimball (1957), occurs when a researcher produces the right answer to the wrong question. A corollary of this rule, as noted by Chatfield (1995, p. 9), is that an approximate answer to the right question is worth a great deal more than a precise answer to the wrong question.
Data
Collections of alike units, their characteristics, features, choice sets, behaviors, etc.
what are the units? In what ways are they heterogeneous?
what units are included? Which ones are missing? Why?
what do the variables measure?
how are the variables measured?
what observations are missing? Why?
what is the sample; what is the population (sampling frame) from which the sample is drawn?
Types of variables
Variables are either
discrete - observations match to integers; all possible values are clearly distinguishable; not divisible. E.g., number of protests in DC this year; an individual’s sex; Polity score.
continuous - observations can take on any real value between boundaries (sometimes $-, +); infinitely divisible. E.g., household income, GDP per capita.
Discrete variables
May be of two types or levels of measurement:
nominal - categories are distinct, but lack order. E.g., religion = Hindu, Muslim, Catholic, Protestent, Jewish. Binary variables are nominal, e.g., Sex = male (0), female (1); do you have blue eyes? yes (0), no (1).
ordinal - take on countable values, increasing/decreasing in some dimension. E.g., Polity -10, -9, \(\ldots\) 0, 1, \(\ldots\) 9, 10 increasing in democracy; survey responses “Do you feel safe traveling abroad?” Not at all; sometimes; yes, completely.
Continuous variables
Can be of two types (levels of measurement):
interval - 1 unit increase has same meaning across the scale (i.e., the intervals are the same); e.g., degrees Celsius or Fahrenheit.
ratio - intervals but also has a meaningful absolute zero; e.g., weight in pounds; zero lbs indicates the absence of weight; Venmo balance = zero, means actually no money; degrees Kelvin. Duration of a war in days - zero days means there’s no war.
Levels of measurement
These four levels or measurement can be ordered by the amount of information a variable contains:
nominal
ordinal
interval
ratio
We can turn higher levels to lower levels, but not the opposite - doing so sacrifices information.
Levels of Measurement and Models
In general, the level of measurement of \(y\) (so the type and amount of information in a variable) shapes what type of model is appropriate.
discrete variables usually require statistics/models in the Binomial family (for our purposes, mostly MLE models like the Logit.)
continuous variables usually require statistics/models in the Normal/Gaussian family (for our purposes, mostly OLS models like the linear regression.)
Describe these data
Continuous or discrete; what can you say about these data from their observed distribution?
Kennedy, Peter E. 2002. “Sinning in the Basement: What Are the Rules? The Ten Commandments of Applied Econometrics.”Journal of Economic Surveys 16 (4): 569–89.
Lake, David A. 1992. “Powerful Pacifists: Democratic States and War.”American Political Science Review 86 (1): 24–37.